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Abstract 

We show that if F is a convex class of functions that is L-subgaussian, 
the error rate of learning problems generated by independent noise is 
equivalent to a fixed point determined by ‘local’ covering estimates of 
the class, rather than by the gaussian averages. To that end, we es¬ 
tablish new sharp upper and lower estimates on the error rate for such 
problems. 


1 Introduction 

The focus of this article is on the question of prediction. Given a class of 
functions F defined on a probability space (fl, p) and an unknown target 
random variable V, one would like to identify an element of F whose ‘pre¬ 
dictive capabilities’ are (almost) the best possible in the class. The notion 
of ‘best’ is measured via the point-wise cost of predicting f{x) instead of y, 
and the best function in the class is the one that minimizes the average cost. 
Here, we will consider the squared loss: the cost of predicting f{x) rather 
than y is (/(x) — and if X is distributed according to /i, the goal is to 
identify 

f* = argminj-gp,]E(/(X) - Yf = argminj-gp,||/ - Y\\l^, 

where the expectation is taken with respect to the joint distribution of X 
and Y on the product space H x M. 

‘Department of Mathematics, Technion, I.I.T, Haifa 32000, Israel 
email: shahar@tx.technion.ac.il 

Supported in part by the Mathematical Sciences Institute, The Australian National Uni¬ 
versity, Canberra, ACT 2601, Australia. Additional support was given by the Israel 
Science Foundation grant 900/10. 


1 



The information at one’s disposal is rather limited: a random sample 
selected according to the A^-product of the joint distribution of 
X and Y. And, using this data, one must select some (random) f G F. 

Definition 1.1 Given a sample size N and a class F defined on a 

learning procedure is a map T : (0 x M)^ —)• F. For a set y of admissible 
targets, T performs with confidence 1 — 5 and accuracy £p if for every Y £ y, 
and setting f = T((Aj, 

E((/-y)2|(A,,y,)ili) < E(/*(A) -y)2 + 

with probability at least 1—5 relative to the N-product of the joint distribution 
of X and Y. 

The accuracy (or error) £p is a function of F, N and 5, and may depend 
on some features of the target Y as well, for example, its norm in some Lq 
space. 

A fundamental problem in Learning Theory is to identify the features of 
the underlying class F and of the set of admissible targets y that govern £p; 
in particular, the way £p scales with the sample size N (the so-called error 
rate). This question has been studied extensively, and we refer the reader to 
the manuscripts [SUgEllITlElEKinKn] for more information on its history 
and on some more recent progress. 

Here, the aim is to obtain matching upper and lower bounds on £p that 
hold for any reasonable class F, at least under some assumptions which we 
will now outline. 

It is well understood that the ability to predict is quantified by various 
complexity parameters of the underlying class. Frequently, one encoun¬ 
ters parameters that are based on various gaussian and empirical/multiplier 
processes indexed by ‘localizations’ of F (see, e.g., m), and any hope of ob¬ 
taining matching bounds on £p must be based on sharp estimates on these 
processes. Unfortunately, the analysis of empirical/multiplier processes is, 
in general, highly nontrivial. Moreover, and unlike gaussian processes, there 
is no clear path that leads to sharp bounds on empirical processes, and 
even when upper estimates are available, they are often loose and lead to 
suboptimal bounds on £p. 

The one generic example in which a more satisfactory theory of empiri- 
cal/multiplier processes is known, is when the indexing class is L-subgaussian. 
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Definition 1.2 A class F C 1^2 (a*) is L-subgaussian with respect to the 
measure p, if for every p >2 and every /, /i S F U {0}, 

11/ - Hlau) ^ ^Vp\\f - ^IIl2(m)’ 

and if the canonical gaussian process {Gf : f £ F} is bounded (see the book 
^ for a detailed survey on gaussian processes). 

More facts on subgaussian classes may be found in [g dsi n da E]. For 
our purposes, the main feature of subgaussian classes is that the empirical 
and multiplier processes that govern £p may be bounded from above using 
properties of the canonical gaussian process indexed by the class, giving one 
some hope of obtaining sharp estimates. Because of that feature, we will 
focus in what follows on subgaussian classes. 

Despite their importance, complexity parameters are not the entire story 
when it comes to 8p. For example, it is possible to construct a class consisting 
of just two functions, {/i, / 2 }, but if the target F is a l/\/iV-perturbation of 
the midpoint (/i + / 2 )/ 2 , no learning procedure can perform with an error 
that is better than c/y/N having been given a sample of cardinality N (see, 
e.g., [I])- Thus, rather than being solely determined by the complexity of 
the underlying class, there is an additional geometric requirement on F and 
y which is there to ensure that all the admissible targets in y are located 
in a favourable position relative of F (see m for more details). One may 
show that if F C L 2 {p) is compact and convex, any target Y G L 2 is in a 
favourable position relative to F. Therefore, to remove possible geometric 
obstructions, we will assume that F C L 2 {p) is compact and convex. 

Finally, for a reason that will become clear later, we will not study a 
general class of admissible targets T, but rather consider targets of the form 
Y = f{X) + W for some f £ F and W that is orthogonal to span(F) (e.g., 
W £ L 2 that is a mean-zero random variable and is independent of X is a 
‘legal’ choice). 

With all these assumptions in place, let us formulate the question we 
would like to study: 

Question 1.3 Let F C L 2 {p) be a compact, convex class that is L-subgaussian 
with respect to p. Given targets of the form Y = f{X) + W as above, find 
matching upper and lower bounds (up to constants) on £p. 

Let us recall the following standard definitions. 
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Definition 1.4 Let F C L 2 (a^)- Set 

F-/i = {/-/i:/GF} and F - F = {/- /i :/, /i G F}. 

Denote by 

star(F) = {A/ : / G F 0 < A < 1} 

i/ie star-shaped hull of F with 0; F is star-shaped around 0 i/star(F) = F. 
Let {Gj- : / G F} be the canonical gaussian process indexed by F and set 


E||G||f = sup 


E sup Gf : F' C F, F' is finite 
f&F' 


Finally, let D be the unit ball in L 2 (^). 

The best known bounds on £p in the subgaussian context have been 
established in [7] and are based on two fixed points: 

Definition 1.5 For ki,K 2 > 0, set 



ruiFiJ) = inf < 

l^s > 0 : E G < kis^Vn'^ 

( 1 . 1 ) 

and 

rQ{K2j) = inf <{ 

r >0 : E||G||(ir_/)nsD < K 2 sVn'^ . 

( 1 . 2 ) 

Put 





rMi^i) = suprMi^i, f) and rg( k 2 ) = sup rQ(«: 2 ,/). 

f£F f£F 

In the context of the problem we are interested in, one has the following: 

Theorem 1.6 For every L > 1 there exist constants ci, C 2 , C 3 and C 4 
that depend only on L for which the following holds. Let F C L 2 {p) be a 
compact, convex, L-subgaussian class of functions, set Y = fo{X) + W and 
assume that for every p > 2, ||IF||lp < Ly^llVhllij. There is a learning 
procedure (empirical risk minimization performed in F) for which, if 


r > 2 max{rM(co/||IF||L 2 ),rQ(ci)} . 
then with probability at least 

1 - 2exp (-C 2 A^min{l,r^/||IT||U) , 
the error of the procedure is at most £p < r'^. 
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The lower bound that complements Theorem 11.61 uses ‘local’ analogs of 
rM and tq that are based on the notion of packing numbers. 

Definition 1.7 Let E be a normed space and set B to be its unit ball. Let 
Ai{A,rB) be the cardinality of a maximal r-separated subset of A with re¬ 
spect to the given norm, that is, the cardinality of the largest subset C 

A for which \\ai — aj\\ > r for every i ^ j. 

Definition 1.8 For r]i,r ]2 > 0 set 

iMimJ) = inf {s > 0 : logM {{F - f)nAsD, {s/2)D) < iHs^N] . 

and 


IqimJ) = inf {s > 0 : logM {{F - /) n AsD, {s/2)D) < plN] . 


Put 

iMivi) = sup 7 Af(r/i,/), and 7Q(r?2) = sup 7 ( 3 ( 72 ,/)• 

f&F feF 

Theorem 1.9 There exist absolute constants ci and C 2 for which the 
following holds. Let F be a class of functions, set W be a centred normal 
random variable and for every f ^ F put = f{X)-\-W. //T is a learning 
procedure that performs for every target with confidence at least 3/4, then 
there is some Y^ for which £p > ci 7 |^(c 2 /||lT||i, 2 )- 

Remark 1.10 One should note that a lower bound that is based on 'Jq was 
not known. 

The connection between the two types of parameters is Sudakov’s in¬ 
equality (see, e.g. 0): there is an absolute constant c for which, for every 
H C L 2 (/i), 

csupe M{H,eD) <E||G||//. 

£>0 

To see the connection, assume that for every f G F, < 

Ki(4r)^\/]V, which means that rM{i<.i) < 4r. Applying Sudakov’s inequality 
to H = {F — f) n ArD and for the choice of e = r/2, 

c(r/ 2 ) log ^/2 ^ p < E||G||(^_^)n 4 r-D < 

hence, 7 m(ciKi) < r. A similar observation is true for rg and 7 q, which 
shows that 7 m and 7 q are intrinsically smaller than rg and tm respectively, 
for the right choice of constants. 
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The starting point of this article is fact that the gap between these upper 
and lower estimates on £p is more than a mere technicality. 

The core issue is that the parameters tm and rq are ‘global’ in nature, 
whereas 7 m and 7 q are ‘local’. Indeed, although (F — /) n rD is a localized 
set, is not determined solely by the effects of a ‘level’ that is 

proportional r. For example, it is straightforward to construct examples in 
which E||G||(p’_j-)nrD ^ cry/N because of a very large, p-separated subset 
of (F — /) n rD, for p that is much smaller than r. Thus, even if tm or 
rq are of order r, this need not be ‘exhibited’ by (F — /) n rD at a scale 
that is proportional to r. In contrast, 7 m and 7 g are ‘local’: the degree of 
separation is proportional to the diameter of the separated set, and the hxed 
point indicates that {F — f)rirD is truly ‘rich’ at a scale that is proportional 
to r. 

As noted in [7], the upper and lower estimates coincide when the ‘local’ 
and ‘global’ parameters are equivalent, but that is not a typical situation - 
in the generic case, there is a gap between the two. An example of that fact 
will be presented in Section [5j 

Given that there is a gap between the two sets of parameters, one must 
face the obvious question: which of the two captures £p7 Is it the ‘global’ 
pair, rq and tm, or the ‘local’ one of jq and 7 m? 

Our main result is that the ‘local’ parameters are the right answer - at 
least in the setup outlined above. To that end, we shall improve the upper 
bound in Theorem frel and add the missing component in Theorem 11.91 


Theorem 1.11 For every L > 1 and q > 2 there are constants cq, ..., C 5 that 
depend only of q and L for which the following holds. Let F C L 2 {p,) he a 
compact, convex, L-subgaussian class of functions with respect to p. There 
is a learning procedure T : (11 x M)^ ^ F, for which, ifY = f{X) + W for 
f G F and W G Lq that is orthogonal to span(F), then with probability at 
least 

loff^ 7V^ 

1 - 2 exp(-colVmin{l, 7 |^(ci/||IT||Lj}) 


8p < C 3 max 


7m 


Cl 


7q(c4) \ +r^(c 4 )exp(-C 5 exp(Af)) 


The term rQ(c 4 ) exp(—C 5 exp(A^)) is almost certainly an artifact of the 
proof, but in any case, it is significantly smaller than the dominating term 
in any reasonable example. 

To complement Theorem 1 1.11 1 we obtain the following lower bound. 
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Theorem 1.12 There exist absolute constants cq and ci for which the fol¬ 
lowing holds. Let F C L 2 {g) he a convex, centrally-symmetric class of func¬ 
tions and letbe any learning procedure that performs with confidence 7/8 
for any target of the form Y = f{X) -\-W for some f G F and W G L 2 that 
is orthogonal to span(F). 

• For any W G L 2 that is orthogonal to span(F), there is some f G F, for 
which, for Y = f{X) + W, 


£p > co 7 q ( ci ). 


• If W is a centred, normal random variable that is independent of X, there 
is some f G F for which, for Y = f{X) + W, 

An outcome of Theorem 11.111 and Theorem 11.121 is that if IT is a centred 
gaussian random variable that is independent of X, then for any convex, 
centrally-symmetric, L-subgaussian class F, the upper and lower estimates 
match (up to the parasitic and negligible term rQ(c 4 ) exp(—C 5 exp(A^)) in 
the upper bound): when considering targets of the form Y = f{X) + W for 
f^F, 

£p ~ max{ 7 ^(ci), 7 |^(c 2 /||lT||L 2 )} . 

The second part of Theorem 11.121 follows from Theorem 11.91 We have 
chosen to present a new proof of that fact - a proof we believe is both in¬ 
structive and less restrictive than existing proofs. The first part of Theorem 
11.121 is, to the best of our knowledge, new. 

Let us mention that if F happens to be convex and centrally symmetric 
(i.e. if / G T then — / G F), what is essentially the ‘richest’ shift of F is the 
0-shift. Indeed, since F — F = 2F, it is evident that for every f G F 

{F - f)n 4rD C {F - F)ri ArD = 2{F n 2rD). 

This makes one’s life much simpler when studying lower bounds, as it gives 
an obvious choice of where to look. Indeed, the ‘richest’ part of F is the 
hardest part for a learning procedure to deal with - and that part is a 
neighbourhood of 0 . 
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1.1 The idea of the proof of the upper bound 

The proof of the upper bound is based on the following decomposition 
of the squared excess loss: let Y be the unknown target and set f* = 
aTgminjr^pWf — YWl^. For every f & F, let £f{X,Y) = {f{X) — Y)‘^ and set 

Cj{X,Y) ={£f - if^){X,Y) = (/(X) - Yf - inX) - Yf 

=2(r (X) -Y){f- mx) + (/ - rfix). (i.s) 


Let Pat/i = T ^i) and set 

/ = argminj-gA’-Pivf/ = argmin^g^’-PAf^/ 


to be the empirical minimizer in F. The learning procedure that assigns 
to every sample (Xj,li)^^ the empirical minimizer in F is called Empirical 
Risk Minimization (ERM). 

Clearly, TJ* = 0, and thus, for every sample {Xi^Yi)f^^, 


PnCJ < 0 , 


implying that members of the random set {/ G P : PnFJ > 0} cannot be 
empirical minimizers. One way of identifying that set is via the decompo¬ 
sition m- assume that (Xj,li)^^ is a sample for which, if ||/ — /*|| > r, 
one has 

1 ^ 

'" 11 /-/' 111 - (‘. 4 ) 

i=l 


and 


1(X,) - Yi){f - mx,) - E(r(X) - y)(/ - nm 
1=1 

(1.5) 

Since P is compact and convex, by properties of the metric projection onto 
a closed convex set in an inner product space. 


< 


^-r 


|2 

\L2- 


E(r(x)-y)(/-r)(x)>o (1.6) 

for every f ^ F. Therefore, setting ^ = /*(X) — Y and = /*(Xj) — Y,, 


N 




2=1 


N 


N 




2=1 


+Ee(/-r)(X) >k-2(k/4) >0 
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for every f ^ F that satisfies ||/ — /*||l2 ^ Thus, if (HID and (II. 5p hold 
for the sample then 

{/ G T : 11/ - rU, > r} C {/ e F : > O} 

implying that ||/- /*||l 2 < 

This argument has been used in |10] and was then extended in 
showing that 

- which is the type of result one is looking for. 

This method of proof leads to the complexity parameters vq and r^: the 
former controls the quadratic component (11.41) and the latter the multiplier 
component (II.5|) . The ‘global’ nature of rq and rM, i-e., the fact that the 
two depend on the gaussian oscillation cannot be helped: the 

oscillations of the quadratic and multiplier processes are highly affected by 
the ‘richness’ of F around f* at every ‘level’. 

A rather obvious idea for improving the upper estimate is ‘erasing’ all 
the fine structure of F, for example, by replacing F with an appropriate 
separated subset. The difficultly in such an approach is that the geometry 
of a separated set is problematic, and ()1.6p will no longer be true for an 
arbitrary target Y. This is why we only consider targets of the form f{X) + 
IT for / G F and IT that is orthogonal to span(F). For such targets, a 
version of (11.61) happens to be true even if F is replaced by a separated set. 

The path we will take in proving the upper bound is as follows: 

• Choose a ‘correct’ level r using the parameters 'Jm and yg for well-chosen 

constants r]i and r /2 that depend only on q and L. 

• Replace F by T, a maximal r-separated subset of F with respect to the 

L 2 (/i) norm, and study ERM in V. To that end, set vq = argmin^gy ||u— 
Y\\l^ and observe that by the orthogonality of IT to span(F), for every 
vGV, 

|E(uo(X) - Y){v - vo){X)\ = |E(uo - /*)(u - vo){X)\ < r\\v - vo\\l,. 
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Therefore, the empirical excess loss relative to V satisfies 


1 ^ 


-2 


2=1 
1 


N 


- J2ivo{Xi) - Yi){v - vo){Xi) - E{voiX) -Y)iv- vo){X) 


2=1 


-2r||7; - Uo||l, 2 - 


• Next, one may study the corresponding quadratic and multiplier processes 

indexed by localizations of V and show that with high probability, 
if ||u — uo||l 2 — then PnC-X > 0- Thus, ERM performed in V 
produces v for which ||D — ^ 01^2 ^ cir. 

• It is possible to show that on the same event, ||D—/*||l 2 — ^25"- And, using 

the orthogonality of W to span(E) once again, 'K{C?\{Xi,Yi)f^^) < 
c^r^, as required. 


2 Preliminaries 

Let us begin with some natation. Throughout, absolute constants are de¬ 
noted by c. Cl,... etc. Their value may change from line to line. c{a) is a 
constant that depends only on the parameter a. We use k-i, K 2 ,r]i,r ]2 etc. 
to denote hxed constants whose value remains unchanged throughout the 
article. 

In what follows, we will, at times, abuse notation and not specify the 
probability space on which each random variable is dehned. For example, 
11 / — T||x ,2 = ^{f{X) — y)^ and integration is with respect to the joint 
distribution of X and Y, while ||/ — / 0 III 2 = ®(/ “ fo)‘^{X), in which case 
integration is with respect to /i. 

Next, let us turn to the notions of cover and covering numbers. 

Definition 2.1 Let B be a unit ball of a norm. Set J\f{A,B) to be the 
minimal number of centres ai,...,an G A for which A C ur=i(®* + 

is called a cover of A with respect to B. An r-cover is a cover with 
respect to the set rB. 

It is standard to verify that if ui, ...,am is a maximal separated subset with 
respect to B then it is also a cover with respect to B. Indeed, the maximality 
of the separated set implies that every point a G A has some Oi for which 
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||«i — a|| < Ij i-e, a ^ ai + B. Therefore, N{A,B) < M-{A,B). In the 
reverse direction, if ai, ...,a„ is a cover with respect to B, then each one of 
the balls Ui + B contains at most one point in any 2-separated set. Thus, 
M(A,2B} <Af(A,B). 

The following lemma is straightforward but it plays a crucial part in 
what follows. 

Lemma 2.2 Let T C W C L 2 (//). For s > r > 0, set 
4>{s, r) = sup M{T n (re + sD),rD). 

wGW 

Then 

1. < (p{s,s/2) ■ (/>(s/2 ,r). 

2. If T and W are star-shaped around 0 then 

log(/>(s,r) < colog(2s/r) • log (/)(4r, r) 
for a suitable absolute constant cq. 

Proof. Fix w gW and let ti, ...,tN gTD {w-\-sD) be centres of a minimal 
s/2-cover of that set. For every 1 < i < N, 

T n {w + sD) n {ti + {s/2)D) C T n (tj + {s/2)D), 

and Af{T n {L + {s/2)D),rD) < (j){s/2, r), because ti G T C W. Therefore, 

sup AA(r n (re + sD),rD) < sup M{T n (tc + sD), {s/2)D) ■ 4>{s/2, r). 

wGW wGW 

Turning to the second part of the claim, assume that T and W are star¬ 
shaped around 0. Let w G W, set ti,...,tm to be a maximal s/2-separated 
subset of Tn {uj + sD) with respect to the L 2 {p) norm and put yi = {r/s)ti. 
Since T is star-shaped around 0, G T and {yi)fLi is an r/2-separated 
subset of {r/s)w -I- rD. For the same reason, {r/s)w G W, and 

M.{T n (re -|- sD),rD) < sup M{T Ci {v -\- rD), {r/2)D). 

v&W 

Using the standard connection between packing numbers and covering num¬ 
bers and taking the supremum over w, 

(f){s, s/2) = sup N{T n (u; -|- sD), {s/2)D) < sup M.{T Ci {w -\- sD), {s/2)D) 

wGW w£W 

< sup M{T n {w 2rD),rD). 

wGW 
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Iterating the first part of the lemma, 


log(/>(s, r) < log2(2s/r) • sup log A^(T n (w + ArD),2rD) 

wGW 

< log2(2s/r) • sup logAA(T n (w + 4rD),rD) 

wGW 

< log2(2s/r) • log (/)(4r, r), 


as claimed. ■ 

Before we turn to the proof of the upper bound, let us revisit the com¬ 
plexity parameters in question. Since F is a convex class, F—f is star-shaped 
around 0; hence, if s > r 

M {{F - /) n AsD, {s/2)D) < M {{F - f) n 4{r/2)D, rD). 

In particular, if f)<f' then 

logM {{F - /) n AsD, {s/2)D) < T^^iVr^ < 

implying that /) < s as well. 

This simple argument shows that if r < 7 m(^ i; /) then 

log A4 ((F - /) n ArD, {r/2)D) > 

while if r > 7 m(? 71 ) /); the reverse inequality holds. 

A similar assertion holds for 7 q, tm and rg; the rather standard proof of 
these facts, which is almost identical to the argument used above, is omitted. 

3 The upper bound 

Let F C L 2 {p,) be a compact, convex class of functions. Fix r > 0 that 
will be named later and let V to be a maximal r-separated subset of F. 
Note that for every vq G V, F^^ = F — vq \s star-shaped around 0, and 
star(y — Vo) C F — Vo- Using the notation of Lemma [221 let T = VF = Fy^, 
and for s > 2 r > 0 , 

log Af ((star(V — vq)) n sD, rD) < log Af [Fy^ n sD, rD) 

< sup log Af {Fy^ ^^{x — Vo + sD),rD) 

xeF 

<co log(s/r) sup logAf(F„o n (x — uq -|- ArD),rD) 

xGF 

=co log(s/r) sup logAf(F n (x -|- ArD),rD). 

xGF 
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Also, observe that -F H (x + ArD) C ((F — x) n ArD) + x, implying that 
log AA ((star(V — vq)) n sD, rD) < cq log(s/r)-sup logAA ((F — x) fl 4rF>, rD). 

x£F 

(3.1) 

Moreover, the same estimate holds for {V — vo) H sD, and since V — vo is 
r-separated, 

log |(y — Vq) n sD\ = log Ad {{V — Vq) fl sD, rD) < logAA ((y — Vq) Pi sD, (r/2)D) 
< log AA (F^o n sD, {r/2)D) 

<co log(s/r) • sup logAA ((F — x) n 4rF>, {r/2)D) 

xGF 

<co log(s/r) • sup log Ad ((F — x) n 4rD, {r/2)D) (3.2) 

xGF 

With that in mind, fix constants rji,ri 2 ,H 2 and K 3 that will be specified 
later, and for that choice of constants, let r > 0 for which 

sup log Ad {{F — x) n 4rD, {r/2)D) < max {jy^A^r^, 772 A^} , (3.3) 

x&F 

and 

r > rQ(K 2 )exp(-K 3 exp(A^)); 

that is, 

r > max{ 7 M(? 7 i), 7 Q(^ 2 ),?’Q(/« 2 )exp(-K 3 exp(A^))} . 

Let y be a maximal r-separated subset of F with respect to the ^ 2 ( 77 ) 
norm. Following the path outlined earlier, the idea is to study ERM in 
V, given the data foi’ ^ = /o(Ai) + W. To that end, one must 

control the multiplier and quadratic components in the decomposition of the 
squared loss relative to V: if vq = argmin^gy ||r(X) — E||l 2 ) 

<(X, Y) =(v(x) - y )2 - (vo(x) - y )2 

=2(vo(X) - Y)(v - vo)(X) + {v- vo)\X). 

Let us begin with the multiplier component: 

Lemma 3.1 Fix 0<6<1, L>1 and q > 2. There exist constants cq, 

Cl and C 2 that depend only on L and q and for which the following holds. 

Let F be a convex, L-subgaussian class, set £ Lq for some q > 2 and put 
rji = co^/IICIIig- Then, for every xq £ V, with probability at least 

log'^X 2 2 

1 - ci jv((g/ 2 )-i) - 2exp(-C277ir N), 
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sup 

{veV -.Wv-vqW]^ 



The proof of Lemma 13.11 is based on the following fact from [El- 

Theorem 3.2 For L > 1 and q > 2 there exist constants cq, ci and C 2 that 
depend only on L and q for which the following holds. Let f G Lg, set H to 
he an L-suhgaussian class and denote by dn = 11^11^2- w,u>8, 

with probability at least 




Proof of Lemma 13.11 The proof consists of two parts: first, controlling 
the process indexed by {f G F : ||/ —uqIIlj ^ -s} where s = (3/2)rM(??i, 
and then treating the process indexed by {u G V : r < ||u — uo||l 2 ^ -s}- 
Clearly, without loss of generality one may assume that r < 

By the regularity of tm and since s > rAf(r/i, uq). 


®l|G'll(F-i)o)nsD < r]iVNs‘^. 


Moreover, {F — vq ) H (s/4)D C {F — vq ) H sD, and since s/4 < 
the regularity of tm implies that 


E||G||(F_^o)nsD > r?iViVs^/16. 


Therefore, applying Theorem 13.21 to the set F[ = (F — vq) H sD, there 
are constants ci, C 2 and C 3 that depend only on q and L for which, with 
probability at least 


1 — ciiV log"^ N — 2 exp {—C 2 ri\s ^, 

if / G F and ||/ - uqIIlj < 


- EC(/ - < C3L||C||L,f?lS^ = (*)• 


2 = 1 
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Clearly, (*) < 9s^ if rji < 6/c^L\\^\\l^, and for such a choice, if ||/ —uqIIlj = ^ 
then 

1 ^ 

i=l 

since F — vq is star-shaped around 0, (|3.4p holds on the same event for every 
/ e F for which ||/ — uo|| > s. 

Next, one has to control the process indexed by {u G C : r < ||u — 
vo\\l 2 < •s}- Set jo = \s/r\, fix Sj = 2^r for 0 < j < jo and let Vj = 
star((l/ — vq) n SjD). By Theorem 13.21 on an event Aj, for every h G Vj, 


(3-4) 


1 ^ 


2=1 


E||G||y. 

< CA{L,q)wjUjU\\L, = {**)j. 


The aim it to ensure that (**)j < 9s‘jlA and that Aj is of high enough 
probability. Indeed, on Aj, \i v and Sj/2 < Hri — uo||l 2 ^ 




< 9\\v 


l|2 


To that end, let Wj = ^/J, recall that dy = sup„g^/ ||^^||l 2 thus dvj = 
Sn = r2F Put 


y/N9 2^r dy 


’ 4c4||^||l„ Vj 1E||G||v. 


and consider two cases: first, if Uj > 8 then clearly, (*) < 9s‘jjA and 


Pr{Aj) > 1 — C5- 


log'^ N 


22i02 


j'?/2 Niq/2) 
Alternatively, if Uj = 8, then 

o f nGWv, 
n dy 


--2e.,l-c,iq,L)N .r . 


> cj{q, L)r^N-^ 


2‘ijg‘2 


MW 


L 2 


Also, by (j3.2p . Vj has at most |(C — vq) n SjD\ extreme points. Since 
log|(P - uo) n < C8log(sV7’)logA4 (F^o n4rF, (r/2)F) 

< C8log(sV^)^iV^F 


15 















by standard properties of gaussian processes 


EllGlly. <cgdvj ■ log ^/^\{V - Vo) n Sj-Dj < ciosjlog^'^^ (^) 

=cior]i\/N<^s'j. 

Hence, there are constants cn and C 12 that depend only on q and L for 
which 


sup 

h^Vj 


N 


N 




i=l 


< Cll—^ 

VJ 



2 ^.s- < OspA 


if r/i < ci20/U\\l,- 

Therefore, in both cases, there are constants C 13 and C 14 that depend 
only on q and L, and with probability at least 

1 - 2exp (-ci4iVr\22i) , 


sup 

h€Vj 




< Qs]/A. 


The claim follows by applying the union bound to this estimate for 0 < j < 

k- ■ 


Next, let us turn to the infimum of the quadratic process 


inf 


1 


{v€V:\\v—vo\\L^>cr} N 



{v - vq) 

V - Vo\\l2 




(3.5) 


where r was selected in (13.31) for a well-chosen 772 and where c is a suitable 
constant. 


Lemma 3.3 For every L > 1 there exist constants cq, ci and C 2 that depend 
only on L for which the following holds. For every vq € V, with probability 
at least 1 — 2exp(—coiV), if v ^ V and Hu — uo||l 2 > cir then 

1 ^ 

— ^(u - Uo)^(Xj) > C2||u - Uolli^. 
i=l 
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The proof of Lemma 13.31 is similar to the one used in the analysis of the 
multiplier process: controlling relatively ‘large distances’ in F, i.e., when 
/ G T for which ||/ — 1:0111,2 ^ ( 3 / 2 )rQ(r/ 2 ) = s; and then ‘small distances’ 
in V, that is, n G 1^ for which r < ||n — no||L2 ^ s (again, one may assume 
that r < rQ{r] 2 )). 

For the constant r/2 (yet to be specihed), one has 
• for every 2r < t < s, 

log AA ((star(l/ — uq)) n sD, tD) < cq log(2s/t) • rj^N, 

and 

log |(star(l/ — no)) n sD\ < cq log(2s/r) • r/f A^. 

The required lower bound on the infimum of the quadratic process (13.51) 
is based on estimates from m and m, which will be formulated under the 
subgaussian assumption, rather than using the original (and much weaker) 
small-ball condition. 

Theorem 3.4 For every L > 1 there are constants ks and kq that 
depend only on L for which the following holds. Let H he an L-suhgaussian 
class that is star-shaped around zero. Set Hp = H r\ pD and fix p for which 

IE||G||h, < tii'/Np. 

Then, with probability at least 1 — 2exp(— ksA^), 

inf 

{h&H,\\h\\L^>P} N VII^IU2 



We will apply Theorem 13.41 to the class H = (F — no) n sD (large dis¬ 
tances) and then to Vj = star ((F — no) H SjF) for sj = 2^r (small dis¬ 
tances). 

Lemma 3.5 There exist absolute constants cq and ci for which the following 
holds. For every s > p > c^r, 

lE||G||v,.npD < cir] 2 ^ {p\og^/'^{2sj/p) -k r log^/^(2s/r)^ . 

In partieular, setting p = Sj/2 for r ]2 = C 2 K 4 , one has 

'^\\G\\vjn(sj/2)D < K4'/N{Sj/2). 
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Proof. Fix p < Sj and note that by Dudley’s entropy integral bound (see, 

e.g., mm), 


nOWv.npD < Cl Af{VjnpD,tD)dt 

Jo 

<ci f J\f {Vj D pD,tD) dt + Cl [ J\f [Vj pD,tD) dt. 
Jo Jr 

Applying (13.ip and since 

Vj = star {{V — Vo) H SjD) C (star(F — uq)) n SjD, 
it follows that for r < t < p, 

logj\f(Vj n pD, rD) < logAA ((star(l/ — uq)) n pD, rD) 

<C 2 log(2p/r) • sup log AA {{F — x) n ArD, rD) 
xeF 


<C 2 log(2p/r) • T]^^. 


Moreover, by (13.2p . 

log|(l/ -xo) n SjD\ < C2log{2sj/r) ■ plN = (*). 

Hence, Vj is the union of at most exp(*) ‘intervals’ of the from [0,u — uq], 
and for t < r, 

logAf{Vj n pD,tD) < C 2 (r/2^1og(2sj/r) + log(2p/t)) . 

Now the hrst part of the claim follows from integration, and the second part 
is an immediate outcome of the first. ■ 

Proof of Lemma 13.31 Combining Theorem 13.41 and Lemma 13.51 for r ]2 = 
C 0 K 4 , it follows that with probability at least 1 — 2exp(— ksA^), if u G P and 
Sj/2 < ||u - vo\\l2 < Sj, 

1 ^ 

— '^{v - vo)^{Xi) > kg\\v - voWl^. (3.6) 

Repeating this argument for Sj = 2D and then applying it to the set Fy^^CisD 
for s = {3/2)rQ{p2), h follows that if log 2 (s/r) < exp(K 5 Ai/ 2 ) then with 
probability at least 1 — 2exp{—K5N/2), (13.61) holds for every v G V that 
satisfies ||u — uo|| L 2 > cir. m 
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With all the ingredients in place, we may now conclude the proof of the 
upper estimate. 

Fix Jq € F and set Y = fo{X) + W for W € Lq that is orthogonal to 
span(F). Let r, V and vq as above. Clearly, for every v G V, 

ll^->^llL = ll^^llL + ll^-/o|lL, (3.7) 

and thus ||uo — /o||l 2 ^ Moreover, for every v G V, KW ■ {v — uo)(W) = 0 
and 


|E(uo(W) - Y){v - vo)iX)\ = |E(uo - /o)(X) • {v - uo)(X)| 

<llw - /oI|l 2 • lb - wl|i,2 < ’’lb - wl|i,2- 

By Lemma [3.31 with probability at least 1 — 2exp(—fi;5A’/2), \i v G V 
and ||u — ^01^2 > c{L)r, then 


1 ^ 

— -7;o)^(Wi) > 


^ K6||U - VoWl^. 


2=1 


Using the notation of Lemma [^TTl set 6 = and rji = CQ{q, L)6 /\\W\\l^- 
Hence, there are constants ci and C2 that depend only on q and L, for which, 
with probability at least 

1 - - 2exp(-C27??r2iV), 

for every v gV , ||u — vq\\l^ > 2r, 

1 ^ 

- Y,{vo{Xi) - Yi){v - vo){Xi) - E{vo{X) -Y){v- vo){X) 


2=1 


^ ^^6 II ||2 

< ^Ib-wlli^. 


On the intersection of the two events and for a constant C3 = C 3 {q,L), if 
lb “ w||l 2 > csr then 

TV TV 

=M Eb - vo)\Xi) + - Y^ivoiX,) - Yi)iv - vo)iX,) 


> 


N 

1 


2=1 

N 


2=1 


- - vofiXi) - 2|E(uo(W) -Y){v- vo){X)\ 


2 = 1 


- 2 


TV 


- YMXi) - Y,){v - vo)iXi) - E(t;o(X) - Y){v - vo){X) 


2 = 1 
|2 


>K6|b - w||l 2 “ 2r||u - t;o||l 2 - («:6/4)|b - wllia ^ (’^6/4)|b - wllia- 


19 







Thus, for every such sample, the empirical minimizer v gV satisfies that 

||D - ^^oIIlz < C4r. 


And, since W is orthogonal to span(F), 

E (£f = 11^) - Y\\l^ - Ii/o - y||i, = ||i) - /o - w\\l^ - \\w\\l 

= ll« - /oIlL - 2E1T • (v - fo){X) < (IID - voWl, + lbo - foh,? < (1 + C4)V. 


4 The lower bound 

The lower estimates presented below are based on a volumetric argument. 
The idea is that if a learning procedure is ‘too successful’, a well-separated 
subset of F endows a well-separated subset in (a set that depends on 
Ai,...,AAr)- However, because of some volumetric constraint, there is not 
‘enough room’ for such a separated set to exist, leading to a contradiction. 

The notions of volume are different in the two estimates: one is based 
on the Lebesgue measure while the other is determined by the choice of the 
‘noise’ W, which is, in our case, gaussian. 

Definition 4.1 LetF he a class of functions and assume thatX = {xi, £ 

. For every f G F, set 

/C(/, X) = {h G F : h{xi) = f{xi) for every 1 < i < N}. 

The set /C(/, X) is called the version space of F associated with f and X. 

In other words, /C(/, X) consists of all the functions in F that agree with 
/ on X. Naturally, in the context of learning, X is a random sample 
selected according to the underlying measure fi. 

The diameter of the version space is a reasonable choice for a lower 
bound on the performance of any learning procedure: if = /(Aj) -|- Wi, a 
learning procedure cannot distinguish between / and any other function in 
the version space associated with / and (Aj)^^. Hence, the largest typical 
diameter of a version space should be a lower estimate on the performance 
of any learning procedure, as the following well-known fact shows (see, e.g., 

[ 7 ])- 
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Theorem 4.2 Given a random variable W, for every f ^ F set = 
f{X) + W. // 4^ is a learning procedure, then 

supPr > 1/2, 

where the probability is relative to the product measure endowed on (fi x 
by the N-product of the joint distribution of X and W. 

Clearly, if W is orthogonal to span(T), then for every h G F and every 
target Y^, ECl = \\h-f\\l^. Thus, the largest typical diameter of a version 
space /C(/, X) is a lower bound on 8p for the set of admissible targets y = 
{f{X) + W : f GF}. 

This leads to the following question: 

Question 4.3 Given a class F defined on a probability space f G F 

and X = (xi, ...,xn) C 81^, find a lower estimate on 

diam(/C(/,X),L 2 (^)). 

One situation in which Question 14.31 is of independent interest is when 
T C M"" is a convex body (i.e., a convex, centrally-symmetric set with a 
nonempty interior) and T = - ^ t £ T| is the class of linear functionals 

associated with T. For every xi,...,xn G M"' set X = {xi, ....,xn), and let 
Fx = ■)ei be the matrix whose rows are xi, ...,xn- Thus, 

/C(0,X) =ker(Fx)nr. 

If fi is an isotropic, L-subgaussian measure on M"', one may show that with 
probability at least 1 — 2exp(— cqA^), 

diam(/C(0,X),L2(/i)) < 2rQ(ci(L)) (4.1) 

(see mi)- This extends the celebrated result of Pajor and Tomczak-Jaegermann 
[laiiB], that (14.Ij) holds for the Haar measure on S'” ^ (and thus, also for 
the gaussian measure on M”). 

It turns out that (|4.1I) is not far from optimal: 

Theorem 4.4 There exists an absolute constant c for which the following 
holds. Let F C L 2 {p) be a convex and centrally-symmetric set. If 

logA4(T n 2rD, (r/Y)D) > cN, 

then for every X = (xi,..., xat), 

diam (/C(0, X), L 2 (//)) > r/8. 
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Since F is convex and centrally-symmetric, F — F = 2F and 0 € -F. There¬ 
fore, 

M.{F n ArD, {r/2)D) < sup {{F — x) n 4rF, {r/2)D) 

xGF 

<M {{F - F) n 4rF, {r/2)D) = A4 (F n 2rD, (r/4)F) . 

Hence, Theoreni l4.4l shows that if 7q(c, 0) > r then for every X = (xi,..., xat), 
diam(/C(0,X),F2(/r)) > r/8. In particular, for every W G L 2 that is orthog¬ 
onal to span(F), the best possible error rate in F that holds for every target 
= f{X) + W, is at least 7 q(c,0) > ci 7 q(c). 

Proof. Let fi,fm be r/4-separated in F n 2rD. Set 

A = ^ + ^{Fn2rD), 

and observe that C F n 2rD. Also, for every h G Ai, \\{fi/2) — h\\L 2 < 
r/16; therefore, if hi G Ai and G then \\hi — > ^/8. 

Fix X = (xi,..., Xat) and for A C F set 

Fx(A) = {(h(A,))iIi -.hGAjcR^, 

the coordinate projection of A associated with X. Clearly, for every 1 < z < 
m, 

Fx(A,) = i(/,(x,))jLi + ^Fx(Fn2rF). (4.2) 

Consider two possibilities. First, if there are z / £ for which Fx(Aj) n 
Px{Ai) / 0, there are hi G A* and h£ G Ai that satisfy hi — hi G /C(0,X), 
thus showing that diam(/C(0, X), F 2 (/z)) > r/8. 

Otherwise, the sets Px{Ai) are disjoint subsets of Px{F n 2rD). And, 
setting T = Fx(F n 2rD), (|4.2I) implies that M{T,T/32) > m. Since T is a 
convex, centrally symmetric subset of , a standard volumetric argument 
shows that M.{T,T/32) < exp(cA^) for a suitable absolute constant c. Thus, 
if m > exp(cA'), diam (/C(0, X), F 2 ) > r/8, as claimed. ■ 

The final result of this section is the ‘noise-dependent’ lower bound. 

Theorem 4.5 There exist absolute eonstants ci and C 2 for whieh the follow¬ 
ing holds. Let F C L 2 (/z) be a convex, centrally-symmetric class of functions, 
set W to be a centred normal random variable that is independent of X, and 
for every f G F, put = f{X) + W. If T is a learning procedure that 


22 


performs with confidence of at least 7/8 for every Yfi there is some for 
which 

(f^) ■ 

Stronger versions of Theoreni l4.5l (without the assumption that F is con¬ 
vex and centrally-symmetric) may be proved in several different ways: using 
information theoretic tools (see, Theorem 2.5 in [IZ!), or, alternatively, by 
applying the gaussian isoperimetric inequality as in [7]. Both these argu¬ 
ments are rather restrictive, because they relay on rather special properties 
of the noise. 

Although the proof we present below is also for a gaussian noise, the 
argument is less restrictive and may be extended to other choices of noise 
(e.g. when W is log-concave rather than gaussian). The argument is essen¬ 
tially the same as Talagrand’s proof of the dual-Sudakov inequality [8], and 
as such is volumetric in nature: obtaining a lower bound on the measure of 
a shift of a centrally-symmetric set in terms of the Euclidean norm of the 
shift. 


Lemma 4.6 Let A C be centrally symmetric and set z € W'. If n is 
the centred gaussian measure on with covariance a'^lN o-nd \ \ denotes 
the Euclidean norm on M”, then 

I ^\2 ' 


^{z -I- A) > exp I — 


2a2 


v{A). 


Proof. A change of variables shows that 
1 


n{z + A) = 


(27rcr)'^/2 




= exp 


2cj2y (27rcr)'^/2 


dx = 


exp 


(27rcj)'^/2 

{z,t) 


exp 


|t -|- z\ 


dt 




exp ( ) dt = {*). 


Let Ej^i^ be the expectation with respect to the gaussian measure n, condi¬ 
tioned on A. Thus, 

• 


= exp ( ) u{A) ■ E^i^exp ( - 


Since A is symmetric, t) = 0, and by Jensen’s inequality 


(*) > exp ( ) n{A). 
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Proof of Theorem 14.5L Let 'L be a learning procedure that performs with 
accuracy £p for every target = f{X) + VL for / e F and W ~ AA(0,(T^) 
that is independent of X. Note that for the target the true minimizer 
in F is /* = / and for every h £ F, 

¥.Ch = E(h(X) - Yff - E(r (X) - Yff = \\h - fWl^. 

Thus, if r = {xi,yi)f^i E (0 x is a sample on which 4' performs with 
accuracy £p relative to the target Y^, then ||'I'(r) — f*\W^ < £p- 

Let {fj)^i be a subset of F n 4rF that is r/2 separated in F 2 (/i) for 
(r/2)^ = 9£p and fix X = (xi, ...,xn) £ ■ 

For every 1 < j < m, put 

Aj{X) = {(zcOili : 4/ ((x„ fj{xi) + Wi)t,) £ fj + C 

i.e., ^j(X) consists of all the vectors {wi)^i £ for which, upon receiving 
the data (xj, fj{xi) + Wi)^^, 4' selects a point whose L 2 distance to fj is at 
most rj6= 

Let u be the centred gaussian measure on with covariance 
Since TF is a centred gaussian random variable with variance cr^, is 

distributed according to z/, and since it is independent of X, if 4^ performs 
with accuracy £p and with probability at least 7/8, it is evident that 

®v{^\{xi,Wi)f^^ : ^{{xi,fj{xi)+Wi)f^^) £ fj + 

^ u {{{xi,Wi)fL^ : iwi)fLi G Aj(X)}) >7/8. 

A standard Fubini argument shows that there is an event Cj C of 
probability at least 1/2, and for every X = {xi)^^ £ Cj, i^(Aj(X)) > 3/4. 
Observe that if X G Cj then by the symmetry of z^, z^ (—Aj(X)) > 3/4, and 
the centrally-symmetric set Aj(X) n — Aj(X) C Aj(X) satisfies that 

z/(Aj(X)n-Aj(X)) > 1/2. 

Let Zj = {fj{xi))^i. If X G Cj n Cl, the sets Zj + Aj(X) and zi + ^^(X) 
are disjoint, because dz maps Zj + Aj(X) to an r/6-neighbour hood of fj and 
zi + Ai{%) to an r/6-neighbour hood of fi - but \\fj — fi\\L 2 ^ z’/2. Therefore 

m 

Y, lc,(X)zz {zj + {Aj{X) n -A,(X))) < 1; 

1=1 
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integrating with respect to 


m 


J]Exlc,(X)i/ (Zj + (Aj(X) n -^j(X))) < 1 , 


and all that remains is to control ExlCj(X)i^ {zj + (Aj(X) n —Aj{\))) from 
below. 

Applying Lemma 14.61 



u {zj + (Aj(X) n — Aj(X))) > exp 


= exp 


By Chebychev’s inequality and recalling that ||/j||L 2 ^ 



for an appropriate choice of an absolute constant cq and for every 1 < j < m. 
Thus, on an event of measure at least 1/4, X G Cj, v (Aj(X) n —Aj(X)) > 
3/4 and ^^^=1 — coNr'^] therefore, 



Hence, logm < c^Nr'^/a'^, i.e., log A4(F n 4rZl, (r/2)Zl) < {c 2 /a)'^Nr'^, 


implying that £p > C 37 |^(c 2 /o-). 


5 Some Remarks 

We begin this section with an example of ‘natural’ sets, for which there is a 
true gap between the two sets of parameters: rq/rM and ^qj^u- 

Let T C M” be a convex body in M” (i.e., a convex, centrally-symmetric 
set with a nonempty interior), put F = {(t,-) : t G T}, the class of linear 
functionals associated with T and set /r to be the gaussian measure on M”. 

It is straightforward to verify that for every r > 0, {F f] rD,L 2 {n)) is 
isometric to {T n where B 2 is the Euclidean unit ball in M"". Let 

1 < p < 2, and set T = H”, the unit ball in = (M"', || H^^). One may 
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show (see 0) that when p = I, tm and 7 m are equivalent, as are rq and 
7 q. However, such an equivalence is no longer true for \ < p <2 (of course, 
as long as p > 1 + 1/ log n - otherwise, is equivalent to £”). 

To see how that gap between the ‘global’ and ‘local’ parameters is ex¬ 
hibited in Hp for 1 < p < 2, let x = G and set (3:*)(Li to be the 

non-increasing rearrangement of (|xj|)^]^; thus, x* < Recall the well 

known fact (see, e.g., 0), that E||G||snp|^gn is equivalent to 

j if r > C2{p)n~^^^P~^/^\ 

^ if r < C2{p)n~^^/P~^/‘^\ 

Thus, if 

tm ~ IN^I^ > C2(p)n-(Vp-i/2), 

Let us consider the case in which 1 > r S> Set I = 

(l/7-)2p/(2-p) observe that 

BpCiArB^ C |x G M"" : x* < if i < £, and x* < if i > ^| . 

Clearly, for a well-chosen constant C3 one has 'Ylii>c3i < r^/100, and 

R;n 4 rR 2 "C IJ (c 4 rR|,oo + <oo)> 

\I\=Cil 

where B^^ is the unit ball in endowed with the weak £q^oo norm 0 . In 
particular, if |/| = 03^, Bp’^^ C (r/lO)!?^'' the impact of those ‘small’ 
coordinates on Euclidean distances is negligible: 



Hence, separation at scale r occurs only because of the largest ~ i coor¬ 
dinates of the vectors involved. 

On the other hand, the contribution of those ‘large’ coordinates to i^\\G\\B^n4rB^ 
is equally negligible. Indeed, if T = U|/|=m'^^'®2 some m < n /2 and 
a > 1, it is standard to verify that 

n 

Esup^5(iL < csarm^/^ • log^/^ (—) = (*). 
teT ^ V m / 

^Recall that for x € R", IRUg^oo < A if and only if, supj>j^ < A. 
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If m = C3^ = C3(l/r)^^/^^ and since r ;:!> C2{p)n it follows that 

(*) <C ~ lE||G||BnnrSj- 

Thus, as long as r is significantly larger than C2{p)n~^^/'P~^/‘^\ the gaus- 
sian average of the intersection body n ArB^ originates from the ‘small 
coordinates’ in the monotone rearrangement, and in particular, from vec¬ 
tors whose Euclidean norm is significantly smaller than r. Such vectors are 
‘invisible’ to 7 m j which is why 7 m is much smaller than tm- 

5.1 The role of fixed points 

Fixed points are encountered frequently in Empirical Processes and Statis¬ 
tics literature, and almost always with the same goal: obtaining ‘relative’ 
upper bounds on various empirical processes. To obtain such bounds, one 
has to compare the oscillation (i.e., the behaviour of the process indexed by 
{F — F) r\ rD) with some function of r. 

One usually obtains upper bounds on the oscillation via a symmetriza- 
tion argument, leading to a sample-dependent Bernoulli process. Thus, the 
standard outcome is a fixed point equation, linking an entropy integral rela¬ 
tive to the random L2 metric and generated by the sample Xi ,..., Xtv, with 
the desired function of r (see | 18 j for numerous examples). 

Still within the realm of entropy integrals, it is possible to impose addi¬ 
tional structure on the problem, which allows one to replace the empirical L2 
(random) metrics with the global L2{^i) metric. For example, a fixed point 
equation with the same normalization as tm may be found in [2], where the 
setup allows the transition between the random metric and the deterministic 
one - but the ‘philosophy’ of the proof is the same: it is based on an entropy 
integral. 

Since the entropy integral is only upper estimate on the supremum of the 
empirical process in question - regardless of the underlying assumptions, it is 
often loose. Therefore, one would like to find a general argument bypassing 
the whole mechanism of entropy integrals. 

As a first step, and because it is natural to expect that the empirical 
processes in question converges to a gaussian limit, one may try a ‘gaussian’- 
based fixed point, which relies on IE||G||(j?_j7)nrZ)! rather than on an entropy 
integral bound. And, indeed, under a subgaussian assumption, the results 
of [ 7 ] lead to the gaussian-based tm and rq. 

Our results show that tm and rq are not the end of the story and can 
be improved - at least for the special learning problems we consider. The 
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‘right’ fixed points should involve the smaller local entropy estimates rather 
than the oscillation of the gaussian process. 


One hxed point that seems closer in nature to 'Jm than to rM may 
be found in the celebrated work of Yang and Barron m, though a closer 
inspection shows that this impression is inaccurate. 

Comparing m to our results is somewhat unnatural because the setup in 
m is completely different: a function class consisting of uniformly bounded 
functions and an independent gaussian noise, both of which are crucial to 
the proof (see Section 3.2 in [H]). Also, the upper estimate is an existence 
result of a ‘good’ procedure - rather than a specific choice of a procedure; 
the estimate holds in expectation and not with high probability; and it does 
not tend to zero with the ‘noise level’ of the problem. 

All these differences are significant, but are still not a conclusive indi¬ 
cation that the nature of the complexity parameter in |19j is different from 
ours. That indication is the key to the results in |19j : the assumption that 
the underlying class ‘large’ - in the sense that 


£->■0 log M-{F,eD) 


(5.1) 


One should note that this assumption immediately excludes all the modern 
high-dimensional problems, involving classes indexed by subsets of M"". In¬ 
deed, for any convex subset of M”’, the liminf above is 1 rather than strictly 
greater than 1. 


Equation (|5.1h has two signihcant implications: 

• The r/2 log-covering numbers of F and of E n rD are equivalent, which 
means that one may replace the local sets F n rD with F in the 
definition of the hxed points. This makes the proof of the upper bound 
simpler. 


• It essentially restricts the setup to classes that have polynomial entropy, 
which is a considerably narrower scenario. Indeed, for the sake of 
brevity let us ignore cases in which 


lim sup 


£—^0 


logM{F,ie/2)D) 
log M{F,£D) 


L>4 


(if L > 4 then the gaussian process {Gj : f G F} is not bounded and 
the class F is not subgaussian, while if L = 4 an entropy estimate is 
not enough to determine whether the gaussian process is bounded and 
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thus requires a more subtle analysis). When L < 4 and because the 
entropy of the ‘local’ set F n rD is equivalent to the entropy of F, it 
follows that there are 0 < < ^2 < 2 for which, for every e < R small 

enough, 


^ logM{FnRD,{£/2)D) ^ 
~ logM{FnRD,sD) ~ 




Using Dudley’s entropy integral for the upper bound and Sudakov’s 
minoration for the lower one, it is straightforward to verify that 


lE||G||Fn/?D ^ {R/2)D). 


Therefore, the ‘global’ parameters tm and rg are equivalent to the 
local ones 7 m and yg; in fact, the ‘local’ and ‘global’ parameters are 
even equivalent to the ones defined via the entropy integral. Thus, 
the typical situation in m is very different from the problems studied 
here - mainly because of (15.ip . 
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